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Open 



Reference phytogenies are crucial for providing a taxonomic framework for interpretation of marker 
gene and metagenomic surveys, which continue to reveal novel species at a remarkable rate. 
Greengenes is a dedicated full-length 16S rRNA gene database that provides users with a curated 
taxonomy based on de novo tree inference. We developed a 'taxonomy to tree' approach for transferring 
group names from an existing taxonomy to a tree topology, and used it to apply the Greengenes, 
National Center for Biotechnology Information (NCBI) and cyanoDB (Cyanobacteria only) taxonomies 
to a de novo tree comprising 408315 sequences. We also incorporated explicit rank information 
provided by the NCBI taxonomy to group names (by prefixing rank designations) for better user 
orientation and classification consistency. The resulting merged taxonomy improved the classifica- 
tion of 75% of the sequences by one or more ranks relative to the original NCBI taxonomy with the 
most pronounced improvements occurring in under-classified environmental sequences. We also 
assessed candidate phyla (divisions) currently defined by NCBI and present recommendations for 
consolidation of 34 redundantly named groups. All intermediate results from the pipeline, which 
includes tree inference, jackknifing and transfer of a donor taxonomy to a recipient tree (tax2tree) are 
available for download. The improved Greengenes taxonomy should provide important infrastructure 
for a wide range of megasequencing projects studying ecosystems on scales ranging from our 
own bodies (the Human Microbiome Project) to the entire planet (the Earth Microbiome Project). 
The implementation of the software can be obtained from http://sourceforge.net/projects/tax2tree/. 
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Introduction 

A robust universal reference taxonomy is a necessary 
aid to interpretation of high-throughput sequence 
data from microbial communities (Tringe and 
Hugenholtz, 2008). Taxonomy based on the 16S 
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rRNA gene (16S) is the most comprehensive and 
widely used in microbiology today (Pruesse et ah, 
2007; Peplies et al, 2008), but has yet to reach its 
full potential because numerous microbes belong 
to taxa that have not yet been characterized and 
because numerous sequences that could be reliably 
classified remain unannotated. For example, two 
thirds of 16S sequences in GenBank are only 
classified to domain (kingdom), that is, Archaea or 
Bacteria, by the National Center for Biotechnology 
Information (NCBI) taxonomy: this taxonomy is 
likely the most widely consulted 16S-based taxo- 
nomy, despite its disclaimer that it is not an 
authoritative source, in part because classifications 
are maintained up to date through user submissions. 
Most of the un(der)classified sequences are from 
culture-independent environmental surveys; these 
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sequences can swamp BLAST searches, leaving 
users baffled about the phylogenetic affiliation of 
their submitted sequences. This shortcoming has 
been addressed by several dedicated 16S databases, 
including the Ribosomal Database Project (Cole 
et al., 2009), Greengenes (DeSantis et al., 2006), 
SILVA (Pruesse et al, 2007) and EzTaxon (Chun 
et al., 2007), that classify a higher proportion of 
environmental sequences. However, improvements 
are still needed because many sequences remain 
unclassified and numerous classification conflicts 
exist between the different 16S databases (DeSantis 
et al., 2006). Moreover, the emergence of large-scale 
next-generation sequencing projects such as the 
Human Microbiome Project (Turnbaugh et al., 
2007; Peterson et al., 2009) and TerraGenome (Vogel 
et al., 2009), and the availability of affordable 
sequencing to a wide range of users who have 
traditionally lacked access, mean that the need to 
integrate new sequences into a consistent universal 
taxonomic framework has never been greater. 

The Greengenes taxonomy is currently based on a 
de novo phylogenetic tree of 408 135 quality-filtered 
sequences calculated using FastTree (Price et al., 
2010). De novo tree construction is among the most 
objective means for inferring sequence relationships, 
but requires either generation of new taxonomic 
classifications or transfer of existing taxonomic 
classifications between iterations of trees as the 
16S database expands. Previously, we developed a 
tool that automatically assigns names to mono- 
phyletic groups in large phylogenetic trees (Dalevi 
et al., 2007), which is useful for naming novel 
(unclassified) clusters of environmental sequences. 
Here we describe a method to transfer group names 
from any existing taxonomy to any tree topology that 
has overlapping terminal node (tip) names. We used 
this 'taxonomy to tree' approach to annotate the 
408 135 sequence tree with the NCBI taxonomy as 
downloaded in June 2010 (Sayers et al., 2011), supple- 
mented with the Greengenes taxonomy from the 
previous iteration (Dalevi et al., 2007) and CyanoDB 
(http://www.cyanodb.cz). Explicit rank information, 
prefixed to group names, was incorporated into the 
Greengenes taxonomy to help users orient themselves 
and to improve the consistency of the classification. 
We assessed the consistency of the resulting classi- 
fication with the NCBI taxonomy including currently 
defined candidate phyla (divisions), and present 
recommendations for consolidation of 34 redun- 
dantly named groups and exclusion of one on the 
basis that its sole representative is chimeric. 



Materials and methods 

16S data compilation and de novo tree inference 
We obtained 16S sequences from the Greengenes 
database, which extracts these sequences from 
public databases using quality filters as described 
previously (DeSantis et al., 2006). We only used 
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sequences that had <1% non-ACGT characters. 
The sequences were checked for chimeras using 
UCHIME (http://www.drive5.com/uchime/) and 
ChimeraSlayer (Haas et al., 2011). We only removed 
sequences from named isolates if they were classi- 
fied as chimeric by both tools; we removed other 
sequences if they were classified as chimeric by 
either tool or if they were unique to one study, 
meaning that no similar sequence (within 3% in a 
preliminary tree) was reported by another study. 
Quality filtered 16S sequences were aligned based 
on both primary sequence and secondary structure 
to archaeal and bacterial covariance models (ssu- 
align-0.1) using Infernal (Nawrocki et al., 2009) with 
the sub option to avoid alignment errors near the 
ends. The models were built from structure-anno- 
tated training alignments derived from the Com- 
parative RNA Website (Cannone et al, 2002) as 
described in detail previously (Nawrocki et al., 
2009). The resulting alignments were adjusted to 
fit the fixed 7682 character Greengenes alignment 
through identification of corresponding positions 
between the model training alignments and the 
Greengenes alignment. Hypervariable regions were 
filtered using a modified version of the Lane mask 
(Lane, 1991). A tree of the remaining 408 135 filtered 
sequences, (tree_16S_all_gg_2011_l) was built using 
FastTree v2.1.1, a fast and accurate approximately 
maximum-likelihood method using the CAT approx- 
imation and branch lengths were rescaled using a 
gamma model (Price et al., 2010). Statistical support 
for taxon groupings in this tree was conservatively 
approximated using taxon jackknifing, in which a 
fraction (0.1%) of the sequences (rather than align- 
ment positions) is excluded at random and the tree 
reconstructed. We use these support values to help 
guide selection of monophyletic interior nodes for 
group naming during manual curation. 

For evaluation of NCBI-defined candidate phyla, 
we added 765 mostly partial length sequences, that 
failed the Greengenes filtering procedure but were 
required for the evaluation, to the alignment using 
PyNAST (Caporaso et al., 2010; based on the 29 
November, 2010 Greengenes OTU templates) and 
generated a second FastTree (tree_16S_candiv_gg_ 
2011_1) using the parameters described above. 



Transferring taxonomies to trees (tax2tree) 
Having constructed de novo trees, the next key step 
was to link the internal tree nodes to known, named 
taxa. We used the NCBI taxonomy (Sayers et al., 
2011) as the primary taxonomic source to annotate 
the trees, supplemented by the previous iteration of 
the Greengenes taxonomy (Dalevi et al., 2007) and 
cyanoDB (http://www.cyanodb.cz). This taxonomic 
annotation used a new algorithm called tax2tree. 
Briefly, tax2tree consists of the following steps: 

(1) Input consists of a flat file containing the donor 
taxonomy and an unannotated (no group names) 
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newick format recipient phylogenetic tree with 
common sequence (tip) identifiers. The tax2tree 
donor taxonomy is in a very simple format 
(Supplementary Figure Si) comprising a unique 
ID followed by a taxonomy string with rank 
prefixes and was derived from the NCBI taxon- 
omy (ftp://ftp.ncbi.nih.gov/pub/taxonomy/) using 
a custom Python script. The tax2tree algorithm 
first filters out non-informative taxonomic as- 
signments from the donor taxonomy strings 
including the words 'environmental', 'unclassi- 
fied', as described previously (Dalevi et al., 
2007). After filtering, the remaining assignments 
at each taxonomic level from domain to species 
are added to each tip with empty placeholders at 
levels that are missing taxon names. The result of 
this phase is a tree in which some of the tips 
have taxonomic information at some or all ranks, 
imported directly from a donor taxonomy. In 
addition, each node in the phylogenetic tree is 
augmented with a tip start and stop index 
corresponding to a list that contains tip taxon- 
omy information. This structure allows for rapid 
lookups of the taxon names present at all of the 
tips that descend from any internal node. 

(2) Precision and recall values necessary for the 
F-measure calculation (see step 5) are calculated 
and stored on the internal nodes. This caching 
markedly improves performance on large trees. 

(3) Each taxonomic rank at each internal node is 
determined if it is safe to hold a name. A 
taxonomic rank is considered safe if (a) there 
exists a name at that taxonomic rank on the tips 
that descend that is represented on ^50% of 
those tips and (b) the parent taxonomic rank (for 
example, phylum to class) is also safe. These 
names are decorated onto the tree at this point 
resulting in a phylogenetic tree containing many 
duplicate names on internal nodes. 

(4) The F-measure [F= 2 x ((precision x recall)/(pre- 
cision + recall))) (van Rijsbergen, 1979) is then 
calculated for each internal node for each name 
at each taxonomic rank in order to determine the 
optimal internal node for a name. The F-measure 
is defined as the harmonic mean of precision and 
recall, and balances false positives and false 
negatives (precision is the fraction of informative 
tips with a given name under a given node out of 
the total count of informative tips under the 
node; recall is the fraction of informative tips 
descending from a given node out of all the 
informative tips of the entire tree containing the 
same name). Node references and F-measure 
scores are then cached in a 2-dimensional 
Python dictionary external to the tree keyed by 
both the rank level and by the taxon name. After 
all nodes are scored, the 2-dimensional hash 
table is iterated over and for each unique name, 
the internal node with the highest F-measure 
score for each name is retained. Each name will 
only be saved on an internal node once; all other 



internal nodes with that name will be stripped of 
the name. If a tie is encountered, the internal 
node with the fewest tips is kept. The result of 
this phase is that the phylogenetic tree contains 
many names on the internal nodes with each 
name occurring at most a single time. Gaps in the 
taxa names decorated onto the tree are likely as 
the result of polyphyletic groups. 

(5) Backfilling is used to fill taxonomic gaps in the 
unique taxon names left on the phylogenetic tree 
from the F-measure process. For this procedure, 
the input taxonomy is transformed into a tree 
and a Python dictionary is constructed that is 
keyed by the taxon name and valued by its 
corresponding node. A gap is defined as missing 
taxonomy rank name information in the phylo- 
genetic tree between a named interior node and 
its nearest named ancestor (for example, having 
phylum and order names but without a name for 
the intervening class rank). For each internal 
node and nearest named ancestor pair in the 
phylogenetic tree in which a gap occurs, the 
taxon name of the node farthest from the root is 
identified in the input taxonomy. The input 
taxonomy is traversed until the nearest ances- 
tor's taxa name is found. The names of the nodes 
traversed in the input taxonomy tree are then 
appended to the node farthest from the root of 
the phylogenetic tree. Following the backfilling 
procedure, it is possible for the phylogenetic tree 
to have duplicate taxa names. 

(6) A back-propagation procedure identifies redun- 
dant taxon names in the phylogenetic tree that 
can be collapsed into a single clade. Here, we test 
whether any internal node has nearest named 
descendants at a given rank (for example, 
phylum) that all share the same name. If so, the 
name can be removed from the descendants and 
propagated to the internal phylogenetic node 
being interrogated. 

Secondary taxonomies were then applied manually 
to the annotated recipient tree, tree_16S_candiv_gg_ 
2011_1, in ARB (Ludwig et al., 2004) using the group 
tool. For the Greengenes taxonomy, this was achieved 
by displaying the Greengenes taxonomy field at the 
tips of the tree and assigning group names missed 
in the automated taxonomy transfer (mostly higher 
level ranks associated with candidate phyla). For the 
cyanoDB taxonomy, manual assignments were based 
on type species (http://www.cyanodb.cz/valid_genera; 
405 instances in tree_16S_candiv_gg_2011_l; Supple- 
mentary Table Si). The manually supplemented 
taxonomic assignments were then exported from the 
curated tree as a flat file (Supplementary Figure Si) 
using functionality in the tax2tree pipeline and 
reapplied to tree_16S_all_gg_2011_l ensuring manual 
updates were efficiently propagated to both trees. 
The tax2tree software is implemented in the 
Python programming language using the PyCogent 
toolkit (Knight et al., 2007), and is available under 
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the open-source General Public license at http:// 
sourceforge.net/projects/tax2tree/. 

Taxonomy comparisons 

The NCBI and Greengenes taxonomies were com- 
pared for each of the 408 135 sequences in tree_16- 
S_all_gg_2011_l making use of the explicit rank 
designations. The lowest classified rank for each 
sequence was determined and compared (Figure 2a) 
to estimate overall improvements in classification. 
Note that only contiguous classifications were used 
in this estimate, that is, all ranks leading to the 
lowest named rank also had to have names. 
Taxonomic similarities and differences for each 
sequence at each rank were also assessed by 
dividing sequences into five categories, (i) the two 
taxonomies had equal values (same name) at a given 
taxonomic rank, (ii) the two taxonomies had 
unequal values at a given rank, (iii) and (iv) one of 
the taxonomies lacked a value at a given rank and 
(v) both taxonomies lacked a value at a given rank. 
This provided an indication of the type of changes 
that had occurred between the NCBI and Greengenes 
taxonomies (Figure 2b). 

Results and discussion 

The rapid accumulation of sequence data from 
across the tree of life is a boon for molecular 
taxonomy, but also presents a major barrier to 
sequence-based taxonomy curators as it is essen- 
tially impossible to manually curate trees compris- 
ing hundreds of thousands of sequences from 
scratch. We overcame this difficulty by developing 
an automated procedure based on F-measures (van 
Rijsbergen, 1979) for transferring any (donor) tax- 
onomy (in a standard flat text format, Supplemen- 
tary Figure Si) to any unannotated (recipient) tree 
(in Newick format) given common sequence identi- 
fiers. The F-measure is most often used to measure 
the classification performance (precision and recall) 
of information retrieval processes such as database 
searches (van Rijsbergen, 1979). This approach also 
has the potential to provide an assessment of the 
quality of fit between a taxonomy and a tree, which 
could be used to screen multiple taxonomies and/or 
trees before manual curation efforts. 

Construction of the rank-explicit Greengenes taxonomy 
Using quality-filtered sequences from Greengenes 
(DeSantis et ah, 2006) aligned with the secondary 
structure-aware infernal package (Nawrocki et al., 
2009), we constructed a phylogenetic tree using 
FastTree2 (Price et al., 2010) containing 408135 
sequences (tree_16S_all_gg_2011_l). We inferred 
confidence estimates using taxon jackknife resam- 
pling (in which sequences, rather than positions in 
the alignment, are resampled) as it provides a 
conservative guide to group monophyly, which we 
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found greatly assists manual group name curation 
between tree iterations (see below). We then applied 
NCBI classifications to this topology using the 
tax2tree algorithm (Figure 1 and methods) also 
taking advantage of the explicit rank designations 
provided by NCBI to include rank prefixes to group 

names (for example, p(hylum) , c(lass) , 

o(rder)_). Most sequences (69%; 280 488 of 
408135) had uninformative NCBI classifications, 
that is, no rank information below domain (king- 
dom; Figure 2a); of these, most were environmental 
clones designated as 'unclassified Bacteria'. The 
remaining 127 647 sequences with informative clas- 
sifications were then applied to the tree. This 
'unamended' approach alone resulted in an im- 
proved classification, to at least phylum-level, of 
nearly all of the taxonomically uninformative 
sequences (280 452 of 280 488) because most belong 
to known phyla but were simply deposited without 
classifications. 

We then overlaid additional taxonomic informa- 
tion onto the NCBI-annotated tree by manual group 
name curation in ARB. This information consisted 
primarily of candidate phyla and other rank desig- 
nations for environmental clusters imported from 
previous iterations of Greengenes (either assigned 
manually or by GRUNT (Dalevi et al, 2007)), and 
taxonomic information for the Cyanobacteria ob- 
tained from cyanoDB (http://www.cyanodb.cz/). 
This resulted in more informative classifications 
for 75% of sequences by at least one rank up to six 
ranks. These increases in classification depth are 
graphically shown in Figure 2 a by rank. 

Changes in sequence classifications between 
NCBI and Greengenes at each rank are summarized 
in Figure 2b. Most changes were from uninformative 
(domain/kingdom name only) in NCBI to informa- 
tive in Greengenes again reflecting the classification 
of the large fraction of unclassified environmental 
sequence in NCBI. The percentage of changes to 
informative NCBI classifications were relatively low 
(<7% for all ranks), indicating the degree of 
congruence between NCBI and Greengenes classifi- 
cations achieved in part by accommodating poly- 
phyletic groups (see below). Of these type of 
changes, many occurred at the higher ranks parti- 
cularly in the candidate phyla where Greengenes is 
manually curated most intensively (see below). A 
similar comparison with SILVA or RDP was not 
possible because of a lack of explicit ranks in these 
taxonomies. However, by comparing the group name 
immediately following the domain (kingdom) name 
(either Bacteria or Archaea) in the SILVA taxonomy, 
we estimated that only 28% of the 408135 seque- 
nces lacked phylum-level classifications in SILVA, 
as opposed to 68% in NCBI. 

Further, the updated Greengenes taxonomy per- 
formed well in a test of reference taxonomies using a 
naive Bayesian classifier (Werner et al., 2011). In 
this paper, we found that retraining the RDP 
classifier (Wang et al., 2007) using taxa from the 
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A f Lachnospiraceae; g Clostridium; s 

B Unclassified 

C f Lachnospiraceae; g Clostridium; s 

D f Lachnospiraceae; g Clostridium; s 

E f Lachnospiraceae; g Clostridium; s 

F f Lachnospiraceae; g Clostridium; s 

G f Lachnospiraceae; g Dorea; s 

H f Lachnospiraceae; g Dorea; s 

I f Lachnospiraceae; g Clostridium; s Clostridium bolteae 

J f Lachnospiraceae; g Clostridium; s Clostridium bolteae 

K f Lachnospiraceae; g Clostridium; s Clostridium hylemonae 

L f Lachnospiraceae; g Clostridium; s Clostridium hylemonae 





ii) 

0 f Lachnospiraceae 

0 g Clostridium 

0 g Dorea 

0 s Clostridium bolteae 

0 s Clostridium hylemonae 




f Lachnospiraceae;c 

f Lachnospiraceae;c 

f Lachnospiraceae;c 

f Lachnospiraceae;c 

f Lachnospiraceae;c 

f Lachnospiraceae;c 

f Lachnospiraceae;c 

f Lachnospiraceae;c 

f Lachnospiraceae;c 

f Lachnospiraceae;c 

f Lachnospiraceae;c 

f Lachnospiraceae;c 



_Clostridium; 
_Clostridium; 
_Clostridium; 
_Clostridium; 
_Clostridium; 
_Clostridium; 

_Dorea; s 

_Dorea; s 

_Clostridium; 
_Clostridium; 
_Clostridium; 
_Clostridium; 



_Clostridium bolteae 
_Clostridium bolteae 
_Clostridium hylemonae 
^Clostridium hylemonae 



Figure 1 Overview of the tax2tree workflow, (i) The inputs to tax2tree; a taxonomy file that matches known taxonomy strings to 
identifiers that are associated with tips of (that is, sequences within) a phylogenetic tree. To simplify the diagram, only the family, genus 
and species are used, although the full algorithm uses all phylogenetic ranks, (ii) The input taxonomy represented as a tree and a taxon 
name legend for the figure, (iii, iv) Nodes chosen by the F-measure procedure at each rank; (iii) species, (iv) genus and (v) family. In this 
example, the genus Clostridium is polyphyletic, and the F-measure procedure picked the 'best' internal node for the name (uniting tips 
A-F). However, as unique names at a given rank can only be placed once on the tree, this leaves tips I-L without a genus name placed on 
an interior node, (vi) The backfilling procedure detects that tips I-L have an incomplete taxonomic path (species to family) and (vi) 
prepends the missing genus name (obtained from the input taxonomy) to the lower rank because this step of the procedure examines only 
ancestors but not siblings, (vii) The common name promotion step identifies internal nodes in which all of the nearest named 
descendants share a common name. In this example, the node that is the lowest common ancestor for tips I-L has immediate descendants 
that all share the same genus name, Clostridium. This name can be safely promoted to the lowest common ancestor (interior node) 
uniting tips I-L. (viii) The resulting taxonomy. Note that the sequence identified as B was unclassified in the donor taxonomy but is now 
classified as f Lachnospiraceae; g Clostridium; s . 
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Figure 2 A comparison of the NCBI taxonomy to the updated Greengenes taxonomy for sequences in tree_16S_all_gg_2011_l. 
(a) Lowest taxonomic rank assigned to each sequence; (b) taxonomic differences between NCBI and Greengenes at each rank, showing 
the percentage of sequences classified to each of five possible categories (see inset legend; GG, Greengenes) highlighting cases where 
NCBI and Greengenes differ. 
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new Greengenes taxonomy resulted in increased 
classification resolution relative to SILVA or RDP 
for a range of environments (human body habitats, 
snake and mouse gut and soils). 

The value of accommodating polyphyletic groups 
in a 16S rRNA-based taxonomy 

In principle, every taxon should correspond to 
a single monophyletic group in the 16S rRNA- 
based taxonomy, but practical considerations 
make relaxing this constraint very useful. In our 
approach, the back-filling step in the tax2tree (see 
methods) allows multiple groups to be given the 
same rank name. This feature is important for 
taxonomic groups that are well-established in the 
literature but polyphyletic in the reference 16S 
rRNA tree. A prominent example is the class 
Deltaproteobacteria, which rarely forms a mono- 
phyletic group in large 16S rRNA topologies and 
comprises five groups in the current tree_16- 
S_all_gg_2011_l. This result may indicate that the 
Deltaproteobacteria do not form an evolutionary 
coherent grouping and will need to subdivided and 
reclassified. Alternatively, the Deltaproteobacteria 
may be a monophyletic group not resolved in 16S 
rRNA trees due to tree inference artifacts, chimeric 
sequences and/or to limits in the phylogenetic 
resolution of trees constructed from the 16S rRNA 
molecule alone. This issue can best be addressed 
using 'whole genome' tree approaches that have 
greater phylogenetic resolution than single-gene 
topologies. Trees based on a concatenation of 31 
conserved near ubiquitous single-copy gene families 
indicate that the small subset of Deltaproteobacteria 
with genome sequences are monophyletic (Ciccar- 
elli et al, 2006; Wu et al, 2009). A second example, 
also based on concatenated conserved marker genes, 
indicate that the first genome sequence representa- 
tive of the family Halanaerobiales [Halothermothrix 
orenii) is a member of the Firmicutes phylum 
(Mavromatis et al., 2009); which is its (contested) 
classification based on 16S rRNA trees (Ludwig and 
Klenk, 2001). Indeed, the Halanaerobiales is sepa- 
rate from the Firmicutes in the current Greengenes 
topology and is only classified as such because of 
the back-filling procedure. 

Similarly, a number of phylum-level associations 
have been suggested based on concatenated gene 
topologies including a relationship between the 
class Deltaproteobacteria and phylum Acidobacteria 
(Ciccarelli et al, 2006). We predict that at least a 
subset of the currently defined candidate phyla will 
coalesce with other phyla once they are adequately 
represented by genome sequences and whole- 
genome trees can be constructed. Therefore, current 
estimates of the number of 16S-based candidate 
phyla (>50) should only be used as an approxima- 
tion and may well drop as candidatus genome 
sequences accrue. However, regardless of absolute 
number of phyla, there is a strong need for 
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consistent delineation of taxonomic groups between 
public databases, particularly candidate phyla. 
Thus, by retaining polyphyletic groups as sets of 
monophyletic taxa with the same name, we can 
accommodate the uncertainty in our present knowl- 
edge about both the tree and taxonomy and easily 
propagate whole-genome-based classification im- 
provements in subsequent iterations of de novo 
16S rRNA-based trees. 



Reconciliation of NCBI and Greengenes-defined 
candidate phyla 

At the time of writing, the NCBI taxonomy lists 71 
candidate phyla (divisions) of which 30 are repre- 
sented only by partial (<1200nt) sequences. There- 
fore, in order to address the classification of these 
groups we amended tree_16S_all_gg_2011_l with 
765 mostly partial length sequences from GenBank 
and generated a new de novo tree using FastTree; 
tree_16S_candiv_gg_2011_l. We found some NCBI 
candidate phyla to be polyphyletic in tree_16S_ 
candiv_gg_2011_l because of a small number of 
submitter misclassifications or chimeric artifacts. In 
these instances, we reconciled NCBI and Green- 
genes designations using the majority classification 
for a given NCBI group. On the basis of tree_16S_ 
candiv_gg_2011_l, we resolved the 71 candidate 
phyla into 45 monophyletic groups and one chi- 
meric artifact (Table 1). Many proposed phyla 
appear to belong to well-established lineages in- 
cluding the Proteobacteria, Firmicutes, Bacteroi- 
detes, Chloroflexi and Spirochaetes. In cases where 
two or more NCBI candidate phyla were combined 
and did not cluster with more established groups, 
we gave priority to either the oldest and/or the 
largest group. For example, we reclassified candi- 
date phylum kpj58rc (Kelly and Chistoserdov, 2001) 
as OP3 (Hugenholtz et al., 1998) because of the 
priority of OP3 in the literature and larger number of 
representative OP3 sequences in the public data- 
bases. We also compared our classifications to 
SILVA and RDP and in many instances saw 
consistencies. For example, Greengenes and SILVA 
both classify candidate phylum CAB-I as Cyanobac- 
teria and Greengenes and RDP both classify KSAl in 
the Bacteroidetes. Conversely, in some instances we 
saw disagreements, such as candidate phylum GN02 
(Ley et al, 2006) being classified as BDl-5, and WS5 
(Dojka et al., 1998) as WCHBl-60 by SILVA (Table 1). 
This points to the need to consolidate classifications 
and also to give priority to published group names 
where possible. 



Final comments and prospectus 

The new NCBI-reconciled Greengenes taxonomy 
rescues over 200 000 environmental sequences from 
unclassified oblivion. Moreover, the tax2tree pipe- 
line will assist in reconciling information among the 
various 16S rRNA resources (Greengenes, SILVA, 
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Table 1 Greengenes classifications of NCBI-defined candidate phyla (divisions) based on tree_16S_candiv_gg_2011_l. SILVA_106 and 
RDP classifications are included for reference 



Candidate bacterial divisions 


Number of NCBI representative 


Consensus phylum 


-level classification 


(phyla) in the NCBI taxonomy 


sequences; full (partial) h 












(~1 rpp n of npc 


SILVA 


RDP 


AC1 


6 (7) 


p_ACl d 


TA r\c 
LIWJO 




OS-K 


3 (7) 


p Acidobacteria d 


AcidobactGria 


AcidobactGria 


OP10 


69 (279) 


p Armatimonadetes 


UrlU 


Ur 1U 


KSA1 


0 (2) 


p Bacteroidetes d 




BactGroidctcs 


KSB1 


13 (23) 


p Caldithrix 


DGiGrribacterGS 




MSBL5 


0 (1) 


p Chloroflexi 




CniorotiGxi 


NT-B4 


0 (1) 


p Chloroflexi 






CAB-I 


7 (59) 


p Cyanobacteria 


Cyanobacteria 


Cyanobacteria 


OP2 


1 (25) 


p Elusimicrobia" 


TliGrmotogaG 




GN01 


10 (12) 


p_GN01 


Spirochaetes 




GN02 


4 (10) 


p_GN02 


DLJ ±~0 




GN10 


3 (4) 


p_GN02 


pn-i r- 




GNU 


3 (0) 


p_GN02 


Ptll C 

oJJ l-o 




GN07 


0 (4) 


p_GN02 






GN08 


0 (1) 


p_GN02 






GN04 


7 (7) 


p_GN04 


lAUb 




GN12 


0 (2) 


p_GN04 






GN15 


0 (2) 


p_GN04 






GN13 


0 (2) 


p_GN13 






GN14 


0 (2) 


p_GN14 






GN06 


1 (2) 


p_KSB3 


.Proteo bacteria 




NC10 


6 (27) 


p_NC10 


NitrospiraG 


r irmicutes 


NKB19 


4 (11) 


p_NKB19 






KB1 group 


7 (20) 


p_OPl 


EM19 




OP1 


10 (38) 


p_OPl 


EM19 




MSBL6 


0 (5) 


p_OPl 






Sediment-3 


0 (1) 


p_OPl 






MSBL4 


0 (3) 


p_OP3 






kpj58rc 


0 (1) 


p_OP3 






OP8 


36 (390) 


p_OP8 


JN itrospiraG 




JS1 


26 (89) 


p_OP9 


npn 


r irmicutGS 


VC2 


0 (2) 


p Proteobacteria d 






Marine group 


0 (2) 


p_SAR406 






SBR1093 


9 (1) 


p_SBR1093 


ProtGobactGria 




SPAM 


8 (1) 


p_SPAM 


NitrospiraG 




GN05 


4 (9) 


p Spirochaetes d 


SpirochaGtGS 




WWE1 


3 (2) 


p Spirochaetes d 


SpirochaGtGS 




OP4 


1 (1) 


p Spirochaetes d 


SpirochaGtGS 




MSBL2 


0 (6) 


p Spirochaetes d 






KSA2 


0 (1) 


p Spirochaetes d 






Sediment-4 


0 (3) 


p Spirochaetes df 






Sediment-2 


0 (2) 


p_Spirochaetes d ;p_SAR406 8 






GN09 


6 (4) 


p_TG3 


FibrobactcrGS 




TG3 


41 (40) 


p_TG3 


FibrobactcrGS 




MSBL3 


0 (1) 


p Verrucomicrobia d 






Sediment-1 


0 (3) 


p WS3 






GN03 


0 (27) 


p WS3 






KSB4 


0 (1) 


p WS3 




WS3 


WS5 


1 (2) 


p WS5 


WCHB1-60 




WWE3 


116 (0) 


p_WWE3 


OD1 




ZB3 


11 (0) 


p_ZB3 


Cyanobacteria 




TG2 


4 (0) 


p_ZB3 


Cyanobacteria 




SAM 


1 (o) 


Chimera h 


ChloroflGxi 





Abbreviation: NCBI, National Center for Biotechnology Information. 

"The following candidate phyla are not shown because they were consistent between NCBI, Greengenes, SILVA and RDP (where classifications 
were available): BRC1, KSB2, KSB3, OD1, OP11, OP3, OP6, OP7, OP9, SRl, TM6, TM7, WS1, WS2, WS3, WS4, WS6 and WYO. 
b Full-length representatives >1200nt, partial length <1200nt, not all sequences are 16S rRNA. Phylogenetic placements based only on partial 
sequences should be considered probationary until full-length or genomic sequence data become available. 

"Name of phylum that encompasses the majority of the NCBI representative sequences, except where specifically noted. Gaps indicate no 
classification. 

d Not robustly supported as a monophyletic group in tree_408135 (jackknife <70%). 

"On the basis of the position of the single full-length representative after which the group was originally named, the 25 partial length 
representatives are not affiliated with the full-length sequence and belong to the Chlorobi. 

f On the basis of the longest representative of this proposed group (AF142890), the two shorter sequences are members of the Firmicutes. 
g One representative belongs to each phylum; AF142866— Spirochaetes, AF142828— SAR406. 
•"Between Planctomycetes and Chloroflexi. 
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RDP and EZ-Taxon) with phylogenetic trees built 
using different methods, and will, we hope, make it 
easier for users to reconcile taxonomic classifica- 
tions of large data sets obtained using different 
taxonomic schemes. This is especially important 
because which taxonomy is used can have a larger 
effect on the results than which assignment method 
is used (Liu et al., 2008). The new Greengenes 
taxonomy, along with all intermediate data products 
including the tree, can be downloaded from the 
Greengenes web site at http://greengenes.lbl.gov/. 

This work, by automating the process of improv- 
ing the tree and allowing import of taxonomic 
knowledge from elsewhere, provides the first step 
toward an automated pipeline that will immensely 
improve our ability to link organisms to environ- 
ment and to understand the evolutionary change 
associated with phenotypic changes such as adapta- 
tion to a new host, switching to a new habitat or 
adapting to use a new substrate. By providing the 
foundation for organizing microbial knowledge, 
these expanded taxonomies will greatly expand 
our ability to understand the microbes that pervade 
all aspects of life on the Earth. 
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